1,556 research outputs found
Recommended from our members
How to Get the Most out of Your Curation Effort
Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology.</p
Automatically Identifying Gene/Protein Terms in MEDLINE Abstracts
Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., proteināprotein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for āgene/protein-full name mark upā), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation. Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/
Summary of subsonic-diffuser data
Subsonic-diffuser data - exit flow, boundary-layer control, and inlet velocit
SplicePortāAn interactive splice-site analysis tool
SplicePort is a web-based tool for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the selected features are likely to be predictive for other mammals as well. With our interactive feature browsing and visualization tool, the user can view and explore subsets of features used in splice-site prediction (either the features that account for the classification of a specific input sequence or the complete collection of features). Selected feature sets can be searched, ranked or displayed easily. The user can group features into clusters and frequency plot WebLogos can be generated for each cluster. The user can browse the identified clusters and their contributing elements, looking for new interesting signals, or can validate previously observed signals. The SplicePort web server can be accessed at http://www.cs.umd.edu/projects/SplicePort and http://www.spliceport.org
Stereospecific aliphatic hydroxylation upon photoreduction of iron (III)
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/22203/1/0000634.pd
Comprehensively identifying Long Covid articles with human-in-the-loop machine learning
A significant percentage of COVID-19 survivors experience ongoing
multisystemic symptoms that often affect daily living, a condition known as
Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying
scientific articles relevant to Long Covid is challenging since there is no
standardized or consensus terminology. We developed an iterative
human-in-the-loop machine learning framework combining data programming with
active learning into a robust ensemble model, demonstrating higher specificity
and considerably higher sensitivity than other methods. Analysis of the Long
Covid collection shows that (1) most Long Covid articles do not refer to Long
Covid by any name (2) when the condition is named, the name used most
frequently in the literature is Long Covid, and (3) Long Covid is associated
with disorders in a wide variety of body systems. The Long Covid collection is
updated weekly and is searchable online at the LitCovid portal:
https://www.ncbi.nlm.nih.gov/research/coronavirus/docsum?filters=e_condition.LongCovi
- ā¦